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Abstract 

Improving  the  accuracy  of  speech  recognition  technology  by  ad- 
dition of  visual  information  is  the  key  approach  to  multi-modal 
ASR  research.  In  this  work,  we  address  two  important  issues, 
which  are  Up  tracking  and  the  visual  speech  feature  extraction 
algorithm.  In  order  to  utilize  the  multi-modal  ASR  for  natural 
speech,  the  visual  front  end  algorithm  must  extract  affine  and 
lighting  condition  invariant  visual  speech  features. 

This  paper  focuses  on  both  the  lip  tracking  algorithm  using 
the  Bayesian  framework  and  a novel  pixel  based  visual  speech 
feature  extraction  algorithm  based  on  kurtosis  measures  of  the 
frequency  profile  of  the  local  image  blocks.  We  compare  the 
results  of  the  proposed  features  with  the  results  of  outer  lip  con- 
tour based  affine-invariant  visual  features,  and  global  2D  DCT 
features.  Experimental  results  in  this  paper  are  presented  for 
a visual-only  connected  digit  recognition  task  for  performance 
comparison  of  the  visual  features. 

Keywords:  Lip  tracking,  Visual  feature  extraction,  Kur- 
tosis measure. 

1.  Introduction 

The  addition  of  visual  information  to  audio  features  im- 
proves speech  understanding  and  offers  key  advantages  in 
human-computer  interfaces  especially  in  difficult  environ- 
ments [1-6].  Improving  the  existing  state-of-the-art  auto- 
matic speech  recognition  (ASR)  performance  hy  integrat- 
ing the  visual  information  of  the  speaker’s  mouth  region  is 
receiving  significant  attention  from  the  speech  recognition 
communities. 

Some  of  the  initial  difficulties  difficulty  associated  with 
computer  lipreading  (visual  speech  recognition)  are  the  ac- 
curate and  consistent  visual  region  of  interest  (ROI)  extrac- 
tion, and  lip  tracking  algorithm  on  the  fly,  which  needs  to 
be  robust  to  a speaker’s  ethnic  and  gender  variability,  and 
other  visual  appearances  such  as  glasses,  facial  hair,  various 
skin  color,  lip  color,  and  different  lip  shapes.  Another  dif- 
ficulty difficulty  is  the  robust  and  consistent  visual  speech 
feature  extraction. 

The  development  of  a successful  audio-visual  speech 
recognition  technology  capable  of  adapting  itself  to  chang- 
ing environments  will  support  both  industrial  and  military 
applications.  Audio-visual  speech  recognition  research  is  a 
relatively  new  and  advancing  research  area.  A noise  robust 
audio-visual  speech  recognition  system  will  facilitate  use 
of  computers,  increase  reliability  and  worker  productivity, 
and  naturalize  communications  between  human  and  com- 
puters. In  addition,  audio-visual  speech  recognition  tech- 
nology can  facilitate  new  commercial  applications  such  as 
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text-driven  audio-visual  talking  head,  audio-visual  speech- 
to-speech  translation,  and  speech-to-video  conversion  for 
the  hearing  impaired. 

In  our  earlier  research  [1,7],  we  have  implemented 
both  late  integration  and  early  (multi-stream  state  syn- 
chronous) integration  schemes  for  a controlled  audio-visual 
data  set.  For  both  integration  schemes,  the  experimental  re- 
sults showed  that  addition  of  visual  information  improves 
the  recognition  performance.  In  this  paper,  the  following 
objectives  will  be  sought: 

1.  Development  of  a lip  tracking  algorithm,  and 

2.  A novel  visual  speech  feature  extraction  algorithm 
that  satisfies  the  following  three  criteria: 

i.  Affine  (rotation,  scale,  and  shear)  invariance, 

ii.  Chrominance  space  shift  invariance,  and 

iii.  Chrominance  space  scale  invariance. 

In  our  proposed  visual  speech  feature  extraction  method, 
the  criteria  in  step  (i)  is  satisfied  by  affine  correction,  the 
criteria  in  step  (ii)  is  satisfied  by  removing  of  the  DC  com- 
ponent of  the  2D  DCT  coefficients,  and  the  criteria  in  step 
(iii)  is  satisfied  by  the  normalized  higher  order  moments  of 
the  DCT  coefficients  of  the  lip  image  blocks. 

This  work  is  organized  as  follows.  In  section  2,  we 
present  a Bayesian  framework  for  lip  tracking,  parametric 
formulation  of  the  Gaussian  parameters  and  adaptation  of 
the  parameters  on  the  fly.  Section  3 discusses  the  removal  of 
affine  (rotation,  scale,  shear)  effects  from  the  segmented  lip 
image.  In  section  4,  we  discuss  contour  based  affine  invari- 
nat  features,  pixel  based  normalized  2D  DCT  features,  and 
describe  a novel  visual  speech  feature  extraction  algorithm 
based  on  kurtosis  measures  of  the  frequency  profile  of  the 
local  image  blocks  of  the  mouth.  We  present  the  experimen- 
tal setup  and  the  results  in  Section  5.  Section  6 gives  the 
concluding  remarks  and  the  proposed  future  work. 

2.  Lip  Tracking  Using  the  Bayesian 
Framework 

The  basis  of  the  audio-visual  speech  recognition  system  is 
an  efficient  lip  tracking  algorithm.  Computational  time 
constraints  required  by  applications  such  as  audio-visual 
speech  recognition,  animated  talking  head  design,  etc.,  con- 
tribute to  the  difficulty  of  the  task.  Most  lip  tracking  algo- 
rithms build  upon  the  eigenspace  based  face  detector  and 
an  ensemble  of  feature  detectors  which  are  used  to  extract 
pre-specified  landmarks  such  as  nostrils  and  lip  corners  to 
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locate  the  ROI  (mouth  region)  |8,9].  The  deformable  tem- 
plate and  snake  based  methods  [10, 11]  have  also  been  used 
for  this  task.  All  techniques  have  reported  good  results, 
but  accuracy  has  decreased  when  there  are  occlusion  (pro- 
file view),  lighting  condition  change,  texture  changes,  and 
quick  motion.  The  technique  we  propose  uses  color  images 
with  Bayesian  framework  for  classification  which  requires 
the  estimation  of  the  a priori  probabilities  and  class  condi- 
tional density  models.  The  class  conditional  density  and  a 
priori  probability  estimation  processes  are  described  in  the 
following  sections. 

In  the  lip  tracking  problem  there  are  two  distinct 
classes,  lip  and  non-lip.  Therefore,  in  this  section,  the  two 
class  classification  problem  is  discussed  because  each  sam- 
ple in  the  image  frame  either  belongs  to  lip  class,  wj  or  non- 
lip class,  W2.  The  conditional  density  functions  and  the  a pri- 
ori probabilities  are  estimated  using  the  training  data  that 
may  require  extensive  search  to  locate  the  lip  and  non-lip 
regions  in  the  first  frame  in  practice  which  will  not  be  dis- 
cussed here.  The  Bayes  decision  rule  determines  whether 
an  observation,  x,  belongs  to  Wi  or  W2-  One  of  the  most 
commonly  utilized  probability  density  functions  in  practice 
is  the  Gaussian  density  function  due  to  its  computational 
simplicity  and  because  it  models  a large  number  of  cases  in 
nature.  The  Gaussian  parameters  are  estimated  parametri- 
cally using  the  information  from  the  previous  frame  on  the 
fly  which  leads  to  an  adaptive  real  time  lip  tracking  and  seg- 
mentation algorithm. 

2.1.  Parametric  Formulation  of  Gaussian  Density  from 
Sample  Data 

In  the  parametric  formulation  of  the  multivariate  Gau.ssian 
density,  estimation  of  the  mean  vector  and  covariance  ma- 
trices of  the  two  classes,  w\  and  W2,  are  required.  Let  N be 
the  number  of  samples  drawn  from  a class,  Wi,  with  respect 
to  X in  the  n-dimensional  feature  space.  Then  the  general 
multivariate  Gaussian  (normal)  density  given  by 

(X  - Mi)},  (I) 

i — W]  ,W2. 

where  pi  = £[x]  is  the  mean  value  of  the  class  Wj,  and  Ej  Is 
the  n X n covariance  matrices  defined  as 

Ei  =JS[(x-Mi)(x-Mi)^j  (2) 

||E,  ||  represents  the  determinant  of  Ej  and  £'[.]  is  the  ex- 
pected value  of  a random  variable.  The  parameters  /i,  and 
Ej  can  be  estimated  without  bias  by  the  sample  mean  and 
sample  covariance  matrix  as 


Mi 


i = W,,W2 


(3) 


1 

Ei  = ^(xj’’ -Mi)(xJ’' -Mi)^,  i~Wi,W2  (4) 


where  x^'  is  the  jth  sample  vector  from  the  ith  class. 
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2.  /.  1.  Class  Conditional  Mixture  Density  Estimation 

Given  the  data  sets  for  lip  and  non-lip  classes  from  the  previ- 
ous frame,  we  can  form  the  class  conditonal  mixture  density 
function  in  general  as  follows. 

1.  Form  a 6-dimensional  attribute  data  set  for  each  class 
from  color  and  texture  measures  (R,  G,  B,  Rv,  G„, 
B„)  for  each  pixel  location,  and  cluster  it  (possibly 
into  three  clusters  for  lip,  tongue,  and  teeth)  using  an 
unsupervised  K-means  clustering  algorithm. 

2.  Form  the  parametric  class  conditional  density  models 
P{x  I using  the  method  described  in  Section  2.1 
for  each  cluster,  where  i represents  the  cluster  i.d. 

3.  Similarly,  repeat  step  2-6  to  form  the  parametric  class 

conditional  density  models  P(x  \ for  non-lips 

(nL). 

4.  Form  the  conditional  density  mixture  models  using 
weighted  sum  of  the  conditional  densities  belonging 
to  clusters.  That  is, 

c 

P(x  I uij)  = CmT’Cx  I i-L,nL  (5) 

m = l 

where  C is  the  number  of  cluster  for  the  lip  or  non- 
lip class,  and  Cm  = rim/N  is  the  mixture  weight  ob- 
tained by  taking  the  ratio  of  the  number  of  pixels  in 
cluster  m to  total  number  of  pixels  in  that  class. 


2.1.2.  A Priori  Probability  Estimation 

As  shown  in  Equation  10,  a priori  probability  specification  is 
an  important  task  for  a Bayesian  classifier  since  the  thresh- 
old value  of  the  likelihood  ratio  is  based  on  the  a priori  class 
probabilities.  Basically,  it  is  desired  to  obtain  a speaker  and 
time  (frame)  dependent  Bayesian  parameter  set  to  adapt  the 
skin  tone  color  variations  and  lighting  variations  on  the  fly. 
The  selection  of  the  sample  data  for  obtaining  class  mean 
vectors  and  covariance  matrixes  has  direct  effect  on  the 
parametric  representation  of  the  class  conditional  density 
models.  Calculating  the  a priori  class  probabilities  based 
on  the  number  of  pixels  in  each  class  data  is  biased  to  the 
sample  data  so  it  would  be  a poor  choice.  By  careful  ex- 
amination of  the  multi-variate  Gaussian  density  function  in 
Equation  1,  one  intuitional  choice  of  the  a priori  class  proba- 
bilities would  be  biasing  them  to  determinant  of  the  covari- 
ance matrixes  of  the  classes,  as 


p{Wi) 


IIS.II 

I1S,||  + ||S2||’ 


i = Wl,W2 


(6) 


where  p(ii;i ) -I-  p{w2)  = 1.  Figure  1 shows  the  class  regions 
based  on  the  threshold  value  of  the  likelihood  ratio  (Bayes 
decision  rule)  and  the  effect  of  a priori  class  probability  se- 
lection. 


2.2.  Bayesian  Decision  Rule 

Let  X be  an  observation  vector  (a  set  of  features  belong  to 
a pixel  location  in  the  image  frame).  Our  goal  is  to  design 
a Bayes  classifier  to  determine  whether  x belongs  to  wj  or 
W2-  The  Bayes  test  using  a posteriori  probabilities  may  be 
written  as  follows: 


W-2 

p(wi  I x)  ^ p(wi  I x), 


till 


(7) 


Figure  1 : Bayes  decision  rule  and  the  effect  of  the  a priori  class 
probability  values. 


• Obtain  gi(x)  and  g2(x)  using  Equation  11  for  every 
pixel  in  the  image. 

• Use  an  averaging  filter  on  the  gi  (x)  and  Q2  (x)  to  ob- 
tain {51  (x)}  and  {52(x)}.  The  smoothing  operation 
reduces  the  noise  effect. 

• Apply  the  Bayesian  classification  rule  to  every  pixel  in 
the  image  frame  to  obtain  binary  lip  candidate  pixels, 
as 

5i(x)^52(x).  (14) 

Wl 


where  p{wi  \ x)  is  a posteriori  probability  of  Wi  given  x. 
Equation  7 shows  that,  if  the  probability  of  wi  given  x is 
larger  than  the  probability  of  W2,  then  x is  declared  be- 
longing to  wj,  and  vice  versa.  Since  direct  calculation  of 
p(wi  I x)  is  not  practical,  we  can  re-write  the  a poste- 
riori probability  of  wi,  using  the  Bayes  theorem  in  terms 
of  a priori  probability  and  the  conditional  density  function 
p(x  I «ij),as 


p{wi  I x) 


p(x  [ Wi)p(wj) 
P(x) 


(8) 


where  p(x  is  the  mixture  density  function,  and  is  positive 
and  constant  for  all  classes.  Then,  the  decision  rule  shown 
in  Equation  7 can  he  written  as 


p(x 


^y2 

wi)p(wi)  ^ p(x  I 


W2)p{W2) 


or  re-arranging  both  sides,  we  get 


_ P(X  1 Wi)  ^ P{W2) 

I .."T  ^ rTTTT 


p(x  I W2)  wi  p(wi) 


(9) 


(10) 


where  L(x)  is  called  the  likelihood  ratio,  and  p{w2)/p{wi ) is 
called  the  threshold  value  of  the  likelihood  ratio  for  the  deci- 
sion. As  shown  in  Equation  10  a priori  probability  specifica- 
tion is  an  important  task  for  a Bayesian  classifier.  Because 
of  the  exponential  form  of  the  involved  densities  in  Equa- 
tion 10,  it  is  preferable  to  work  vtith  the  monotonic  func- 
tions called  discriminant  functions  following  discriminant 
functions  obtained  by  taking  the  logarithm  of  both  sides  of 
the  Equation  shown  in  9. 


g,(x)  = ln(p{x  1 Wi)p{wi)),  or 


(11) 


9t(x)  = --(x-/ii)  Ej  (x-pi)-Flnp(t(;i)  + Ci  (12) 

where  c,  = —(1/2)  In  2ir  — (l/2)|lEi||  is  a constant.  In  gen- 
eral Equation  12  has  a nonlinear  quadratic  form  and  using 
Equation  12,  the  Bayes  rule  is  as  follows,  which  is  preferable 
for  the  efficiency  of  calculation  speed. 

9i(x)^g2(x).  (13) 

tuj 

2.3.  Lip  Tracking  Algorithm  and  ROI  Selection 

The  Bayesian  framework  descibed  in  this  paper  utilizes 
color  images  with  no  prior  labeling.  The  goal  is  to  segment 
the  lip  region  in  the  current  frame  and  select  the  ROI  for 
the  following  frame  to  limit  the  search  space.  The  basic  lip 
tracking  and  ROI  selection  procedures  are  described  below. 


• Segment  tbe  lip  region  (using  tbe  heuristics  such  as 
largest  region  between  nostrils  and  chin)  in  the  bi- 
nary image  resulted  from  tbe  Bayes  classifier. 

The  Bayesian  classifier  is  applied  to  the  full  image  array 
for  the  first  frame.  But  once  the  lip  region  is  detected  on 
the  current  frame,  the  next  frame’s  search  space  is  bounded 
by  a rectangular  ROI,  obtained  by  enlarging  the  current  lip 
region  by  25%  of  width  and  height  in  vertical  and  horizon- 
tal directions,  respectively.  Thus,  the  Bayesian  classifier  is 
applied  to  the  ROI  on  the  next  frame  to  enable  the  real  time 
lip  tracking  instead  of  the  full  image  array  search. 

Adapting  classifier  parameters  on  the  fly  makes  algo- 
rithm more  robust  to  lighting  changes  between  frames.  Also 
the  initial  color  information  extracted  from  the  first  image 
frame  may  have  several  problems  with  changing  conditions. 
Firstly,  the  color  features  obtained  for  a person  by  a camera 
is  influenced  by  the  ambient  lighting  conditions  and  orienta- 
tion of  the  speaker’s  face  during  speech.  Secondly,  diB'erent 
cameras  produce  significantly  different  color  features  even 
for  the  same  person  under  same  lighting  conditions.  Our 
work  aims  to  overcome  this  difficulty  by  adapting  the  clas- 
sifier parameters  on  the  fly  using  the  information  from  the 
previous  frame.  The  procedure  is  described  as 

• Extract  the  color  features  for  lip  class. 

• Extract  the  color  features  for  non-lip  class. 

• Update  the  classifier  parameters  using  the  data  ab- 
tained  from  above  two  steps. 

3.  Removing  Affine  Parameters  from  Lip 
image 

In  the  audio-visual  speech  and  speaker  recognition  task, 
both  contour  based  and  pixel  based  visual  features  need  to 
be  independent  from  the  affine  (rotation,  scale,  shear  and 
translation)  parameters.  In  order  to  utilize  the  audio-visual 
speech  and  speaker  recognizer  for  natural  speech,  the  lip 
image  for  every  frame  needs  to  be  pre-processed  for  remov- 
ing the  affine  parameters  before  the  visual  feature  extrac- 
tion process  described  in  the  follovnng  sections  is  applied. 
Then,  a question  can  be  posed  whether  if  affine  (rotation, 
scale,  shear  and  tranlation)  parameters  convey  linguistic  in- 
formation to  utilize  for  the  recognition  task. 

3.1.  Lip-Rotation  Problem 

Lip-rotation  correction  on  the  fly  for  natural  speaker  move- 
ment is  essentia]  for  robust  audio-visual  speech  and  speaker 
recognition.  Utilizing  lip  comers  or  some  other  facial  fea- 
tures such  as  nostrils  and  eye  comers  may  be  problematic 
for  rotation  correction  due  to  the  complexity  of  locating 
such  facial  features  accurately  during  natural  speech  [9, 12]. 
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Figure  2:  Lip  rotation  correction:  a)  rotation  correction  using 
the  PCA,  b)  outer  lip  contour  after  rotation  correction,  c)  gray 
lip  image  after  ration  correction  and  scaling  to  96x64  pixels. 


Figure  3:  An  example  of  the  scaling  problem  due  to  speaker’s 
di.stance  to  camera  or  speaker’s  lip  physical  dimensions. 


We  propose  a principal  component  analysis  (PCA)  ba.sed 
rotation  estimation  and  correction  method  to  overcome  the 
difficulties  mentioned  above.  Jump 

.?.  1. 1.  Rotation  Correction  Using  PCA 

Principal  component  analysis  (PCA)  is  a method  for  analyz- 
ing multivariate  data  to  identify  a set  of  new  orthogonal  axes 
known  as  principal  components.  The  first  principal  compo- 
nent is  the  axis  that  describes  most  variance  of  the  data,  the 
second  principal  component  is  the  orthogonal  axis  that  de- 
scribes the  second  most  variance  of  the  data,  and  so  on.  PCA 
is  also  called  the  Hotelling  transform  or  Karhunen-Loeve 
expansion  [13]. 

Let  X = [xixg]^  be  a 2-dimensional  random  variable 
with  mean  m.i  and  covariance  matrix  G based  on  N sam- 
ples of  a lip  image  pixel  locations.  The  mathematical  repre- 
sentation of  PCA  as  follows. 


1 ^ 

-■  fe  = 1,2  so 

*=1 

(i.‘>) 

= [mx\  mx2]'^  and 

(16) 

N 

i = 1 

(17) 

where  T represents  the  transpose  operation.  The  task  is  to 
find  the  new  set  of  orthogonal  axes  and  estimate  the  rotation 
angle  with  the  standard  coordinate  system,  and  then  undo 
the  rotation  of  the  lip  pixel  coordinate  data.  Figure  2 shows 
the  rotation  correction  using  the  PCA  coordinate  rotation. 

In  order  to  estimate  the  rotation  angle  a between  x-axis 
and  u-axis  shown  in  Figure  2a,  we  solve  for  the  eigenvalues 
{Ai , A2}  of  the  covariance  matrix  C and  find  the  eigenvector 
ei  corresponding  to  the  largest  eigenvalue.  The  process  is  as 
follows: 


Figure  4:  An  illustration  of  the  shearing  in  the  horizontal  direc- 
tion. 


The  rotation  corrected  lip  image  is  obtained  by  multiplying 
R~ ' with  the  coordinates  of  lip  pixel  locations,  as 


\x'  1 

**'n 

/ 

= i?"’ 

Xn 

L^/nJ 

(22) 


where  (xn,  j/n)^  represents  the  cartesian  coordinates  of  the 
lip  pixel  locations,  and  (x'„,j/^)^  represents  the  cartesian 
coordinate  of  the  lip  pixel  locations  after  the  rotation  correc- 
tion. Figure  2c  shows  the  orientation  of  the  lip  shape  after 
rotation  correction  and  scaling  of  lip  shown  in  Figure  2a. 


3.2.  Scaling  Problem 

The  scaling  problem  occurs  due  to  the  speaker’s  distance  to 
camera,  the  camera  zoom  factor  and  the  speaker’s  actual  lip 
dimensions.  In  this  case,  any  pixel  based  visual  feature  ex- 
traction method  such  as  DCT  or  wavelet  transform  method 
which  utilizes  the  frequency  content  of  the  lip  image  may 
generate  inconsistent  (noisy)  observation  vectors.  To  over- 
come this  problem,  we  propose  to  interpolate  every  lip  im- 
age to  same  size,  N x M.  Figure  3 shows  the  scaling  prob- 
lem example  for  two  different  speakers  and  the  lip  images 
of  them  after  interpolation  (scale  correction). 


\C  - A/|  = 0, 


(18)  3.3.  Shearing  (Uneven  Scaling)  Problem 


and  then  find  the  eigenvectors  (also  called  proper  vector  or 
characteristic  vector),  calculated  as 


Ce.  = A,ei,  i = l,2  (19) 

where  ei  = [cxi  eyi]^.  The  eigenvector  belongs  to  largest 
eigenvalue  defines  the  rotation  angle  a,  as 


a = atan(eyi/ex-i).  (20) 

Then  the  rotation  correction  matrix  R~ ' can  be  written  as 
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cos(q)  —sin{a) 
sin{a)  cos(q) 


(21) 


Shearing  occurs  when  the  speaker’s  head  position  is  not  per- 
pendicular to  camera  optical  axis.  For  example,  one  side  of 
the  lips  which  may  look  larger  than  the  other.  Solving  the 
shearing  problem  using  the  single  2D  image  information  is 
not  theorotically  possible.  There  can  be  various  practical 
approaches  to  minimize  the  shearing  effiect  such  as  using 
the  symmetry  information  of  the  lips  may  enable  us  to  esti- 
mate the  shear  matrix  by  utilizing  the  least  squares  estimate 
method  and  undo  the  shearing.  Figure  4 illustrates  a typical 
example  of  a shearing  effect  in  the  horizontal  direction. 

The  shearing  may  also  be  associated  with  the  accent  of 
a speaker,  depending  on  certain  visimes.  Then,  the  similar 


question  can  be  posed  whether  shearing  conveys  a linguistic 
information. 

4.  Visual  Speech  Feature  Extraction 

Lipreading  clearly  meets  at  least  two  practicable  criteria: 
It  mimics  human  visual  perception  of  speech  recognition, 
and  it  contains  information  that  is  not  always  present  in 
the  acoustic  signal  [3,4,14-16],  Petajan  is  one  of  the  first 
researchers  who  built  a lipreading  system  using  oral-cavity 
features  to  improve  the  performance  of  an  acoustic  ASR  sys- 
tem [17].  Silsbee  et  al.  [18]  utilized  vector  quantization  (VQ) 
of  acoustic  and  visual  data  for  their  HMM  based  audio  and 
video  subsystems.  Teissier  et  al.  [19]  utilized  20  FFT  based 
1-bark  wide  channels  between  0 and  5 Khz  for  acoustic  fea- 
tures and  inner  lip  horizontal  width,  inner  lip  vertical  height 
and  inner  lip  area  for  the  visual  features.  Chiou  et  al.  [20] 
utilized  active  contour  modeling  to  extract  visual  features  of 
geometric  space,  the  Karhunen-Loeve  transform  (KLT)  to 
extract  principal  components  in  the  color  eigenspace,  and 
HMMs  to  recognize  the  combined  video  only  feature  se- 
quences. Potamianos  et  al.  [14,21]  used  Fourier  descrip- 
tor magnitudes  for  a number  of  Fourier  coefficients,  width, 
height,  area,  central  moments,  normalized  moments  as  con- 
tour features,  image  transform  features,  and  hierarchical 
discriminant  features. 

In  order  to  utilize  audio-visual  ASR  for  natural  speech 
in  varying  lighting  conditions,  the  visual  front  end  algo- 
rithm that  extracts  the  visual  features  must  satisfy  the  three 
criteria  presented  in  Section  1.  The  contour  based  feature 
described  in  Section  4.1  satisfy  step  (i)  in  the  Fourier  do- 
main and  is  relatively  independent  of  step  (ii)  and  step  (iii). 
For  pixel  based  visual  feature  extraction  methods,  step  (i)  is 
explained  in  Section  3.  Steps  (ii)  and  (iii)  are  explained  for 
both  2D  DCT  based  visual  features  and  kurtosis  measure 
based  visual  features  which  are  described  in  Sections  4.2, 
and  4.3,  respectively. 

4.1.  AI-FDs  Based  Visual  Features 

In  general,  for  the  video  feature  extraction,  the  relationship 
between  observed  parametric  outer-lip  contour  data  x and 
parametric  reference  data  x°  can  be  written  as, 

x[n]  = Ax°[n  -1-  r]  -I-  b,  (23) 

where  A represents  a 2 x 2 arbitrary  affine  matrix,  det{A)  ^ 
0,  that  may  have  scaling,  rotation,  and  shearing  affect,  b 
represents  a 2 x 1 arbitrary  translation  vector,  and  r is  start- 
ing point.  These  are  removed  in  the  Fourier  domain  [7, 22] 

The  video  feature  extraction  algorithm  extracts  twelve 
afline-invariant  Fourier  descriptors  (AI-FDs)  of  the  para- 
metric outer  lip  contour  data  as  well  as  four  affine-invariant 
oral  cavity  features  which  are  width,  height,  ratio  of  width 
to  height,  and  outer  lip’s  inner  area  by  normalizing  the  next 
frame’s  corresponding  oral  cavity  features.  Dynamic  co- 
efficients, which  are  used  as  a video  observation  features, 
are  obtained  by  differencing  the  consecutive  image  sequence 
features. 

4.2.  Normalized  2D  DCT  Based  Visual. Features 

The  Discrete  Cosine  Transform  is  one  of  the  many  trans- 
form methods  that  transforms  its  input  into  a linear  combi- 
nation of  weighted  basis  functions.  The  2D  DCT  on  a NxN 


lip  image  can  be  written  as 

Y = C'^X  C (24) 

where  X is  an  NxN  lip  image,  Y contains  the  NxN  DCT 
coefficients,  and  C is  an  NxN  transform  matrix  defined  as 

Cmn  — where  (25) 


r -\/\Jn  when  n ==  0, 

" y/i/N  otherwise 

and  m,  n = 0, 1, ...,  N-1.  Our  goal  is  to  extract  visaul  features 
satisfying  step  (ii)  and  step  (iii),  and  most  relevant  informa- 
tion of  the  lip  shape  from  the  NxN  DCT  coefficients.  Let  1° 
and  I be  lip  shape  images  which  differ  in  a scale  and  shift 
factors  (lighting  condition),  i.e., 

/ al°  + 5,  (26) 

where  a and  5 are  scale  and  shift  factors  in  the  acceptable 
range'  of  the  chrominance/luminance  space. 

From  Equation  25,  we  know  that  the  zeroth  coefficient 
of  the  DCT  transform  contains  the  DC  information  (5  in 
Equation  26)  which  doesn’t  convey  any  shape  information. 
It  is  also  known  that  DCT  is  a linear  transform  and  the  scale 
factor  a just  scales  all  the  DCT  coefficients.  So  normalizing 
all  the  coefficients  in  the  DCT  domain  by  a coefficient  Ymn 
makes  the  DCT  transform  scale  independent.  Then,  35  co- 
efficients from  the  lower  frequencies  are  selected  excluding 
the  DC  information.  Figure  5 shows  the  normalized  2D  DCT 
based  visual  feature  extraction  process. 


0)2  • ■ N 


1 xm  (X)ser\'aiion  vector.  O 


Subset  of  2D  DCT  coefficients, 
where  m is  the  number  of  (scale 
and  .shift  invariant)  coefficienLs. 


Figure  5:  Normalized  2D  DCT  based  visual  feature  extraction. 


4.3.  2D  Kurtosis  Measure  of  the  Probability  Density  Distri- 
bution of  the  DCT  Coefficients 

After  the  rotation  correction  and  size  normalization  of  the 
lip  image,  the  resulting  lip  image  is  divided  into  16  x 16 
sub-blocks  with  50%  overlapping  or  non-overlapping  sub- 
blocks, and  then  the  two-dimensional  DCT  of  the  each 
block  is  calculated.  For  simplicity,  let  Y be  the  matrix  of 
16x  16  DCT  coefficients.  Y(0,0)  depends  only  on  the  chromi- 
nance/luminance space  shift  shown  in  Equation  26,  and  con- 
veys no  shape  information.  Thus,  the  Y (0, 0)  coefficient  is 
removed.  The  remaining  coefficients  are  now  only  chromi- 
nance space  scale  dependent  (see  Equation  26).  We  remove 
the  dependency  on  the  chrominance  space  scale  by  calculat- 
ing the  2D  kurtosis  of  the  frequency  profile  (probability  dis- 
tribution of  DCT  coefficients)  of  each  block  in  the  lip  image 
discussed  in  the  follovring  sections.  Figure  6 shows  the  pixel 

’Reference  and  observed  lip  image  contents  are  clearly  visible  for  a 
range  of  a and 


45 


Figure  6:  Illustration  of  FPM  visual  feature  extraction  {ki  is  an 
appearance  based  visual  coefficient  for  the  ith  lip  image  block). 


based  visual  front  end  process,  where  ko,  h, . . .,k]7  are  co- 
efficients for  the  pixel  (appearance)  based  visual  features  of 
the  lip  image.  In  this  work,  we  will  refer  these  pixel  based 
features  as  frequency  profile  measures  (FPMs),  which  are 
2D  kurtosis  measures  of  the  probability  density  distribution 
of  the  DCT  coefficients. 

In  the  theory  of  probability,  the  classical  measure  of  the 
non-Gaussianity  of  a random  variable  is  the  kurtosis  mea- 
sure. Kurtosis  measures  the  departure  of  a probability  dis- 
tribution from  the  Gaussian  (normal)  shaped  Kurtosis  is 
dimensionless  ratio,  and  greater  than  zero  for  most  non- 
Gaussian  random  variables^.  Specifically,  for  a given  2D  im- 
age block  function  I{n.  m),  where  m,n  = 0, 1, . . . , N,  the 
corresponding  2D  DCT  coefficients  Y (x,  y)  can  be  obtained 
as  described  in  Section  4.2,  where  x and  y are  the  spatial 
frequencies  in  the  DCT  domain.  The  high-frequency  DCT 
coefficients'*  are  discarded  to  minimize  the  video  noise  effect 
which  is  discussed  in  Section  4.3.1.  The  rest  of  the  lower  fre- 
quency DCT  coefficients  Y (x,  y)forx,y  = 1,2,...  N/2,  are 
normalized  to  form  the  bi-variate  probability  den.sity  func- 
tion p{x,  y).  Using  the  notation  of  [23],  for  a given  univariate 
random  variable  x with  marginal  probability  mass  function 
p(x),  mean  and  existing  finite  moments  up  to  the  fourth 
moment,  then,  the  univariate  kurtosis  is  defined  by: 

kurt{x)  = 02  = ^,  (27) 

m.2 

where  m2  and  m^  are  the  second  and  fourth  central  mo- 
ments, respectively.  In  general,  the  kth  central  moment  is 
defined  by: 

mjt  - E[(x  - (28) 

X 

where  marginal  density  function  of  x is 

p(x)  = ^]]p(x,j/),  (29) 

V 

where  E denotes  the  probability  expectation  [24].  If  xi  and 
X2  are  two  independent  random  variables,  then  kurtosis  has 
the  following  linearity  properties: 

kurt{x\  -F  X2)  = kurt(xj)  + kurt{x2)  and  (30) 
kurt{ax\)  = a*kurt{x\)  (31) 

where  is  q is  an  arbitrary  scalar.  Clearly,  any  scale  factor 
in  Equation  27  cancels  out.  Let  Vy  be  a p-dimensional  ran- 
dom vector  with  finite  moments  up  to  the  fourth,  and  p and 

^The  smaller  the  kurtosis.  the  flatter  the  top  of  the  distribution. 

^Kurtosis  is  3 for  any  univariate  Gausain  distribution. 

■•The  upper  half  of  the  DCT  coefficients  are  discarded. 
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Figure  7:  In  search  of  the  lip  region  type  with  96x64  pixel  size 
to  extract  visual  speech  features:  a)  exact  lip  region,  b)  exact 
rectangular  lip  region,  c)  extended  rectangular  lip  region. 

r be  the  mean  vector  and  covariance  matrix  of  W,  respec- 
tively. Mardia  [25]  proposed  the  p-dimensional  multivariate 
kurtosis  as: 

02,p  = E[iW-pfT-\W-fi)f,  (32) 

where  T denotes  the  transpose  of  a vector.  Zhang  [23]  used 
2D  kurtosis  of  random  vectors  for  a sharpness  measure  of 
Scanning  Electron  Microscopy  (SEM)  images.  The  2D  kur- 
tosis 02,2  is  calculated  by 

02,2  = [74,o-F7o,4-f  272,2-F4p(p72,2— 7i,.3-73,i)]/(1— P^)^, 

(33) 

where 

7fc,t  = - Pxf  iV  - Py)‘p{x,1j)/[(^(X  - Pxf 

X y X 

- Pxfp(x))‘^%  (34) 

V 

aly  = E[{x  - pj:){y  - Py)],  = E[{x  - Pt)%  (35) 

and 

P = o’lvl{<Xx(ry).  (36) 

The  2D  kurtosis  measure,  02, 2 1 is  dimensionless  and  scale 
and  shift  invariant  as  seen  in  Equation  33.  In  this  work, 
the  2D  kurtosis  defined  in  Equation  33  is  calculated  using 
the  probability  density  distribution  of  the  DCT  coefficients 
of  the  image  block  function  I{n,m).  We  will  refer  to  the 
02,2  measure  as  the  frequency  profile  measure  (FPM)  of  an 
image  block.  The  image  blocks,  which  have  zero  marginal 
variances  of  x or  y,  are  discarded  for  02,2  calculation,  and 
their  FPMs  are  arbitrarily  assigned  to  the  74,0  value  when 
(Ti  / 0 and  (Ty  = 0,  to  the  70,4  value  when  (jj,  / 0 and 
ax  — 0,  and  to  -I  when  both  ay  = 0 and  ax  = 0. 

4.3.1.  Reducing  the  Effect  of  Video  Noise  in  FPM  Visual  Fea- 
tures 

It  is  known  that  the  low-frequency  coefficients  in  the  DCT 
of  the  video  signal  contain  the  large  details  and  the  high- 
frequency  coefficients  contain  the  finer  details  of  the  im- 
age. Video  noise'  is  clearly  represented  in  the  DCT  coef- 
ficients and  using  the  full  spectrum  of  the  image  leads  to 
noisy  (distorted)  visual  features.  That  is  why  some  of  the 
high-frequncy  DCT  coefficients  were  discarded  in  the  calcu- 
lation of  FPM  of  the  image  blocks  described  in  Section  4.3. 
The  pixel  based  visual  front  end  research  requires  further 
investigation  on  how  to  minimize  the  effects  of  video  noise 
and  the  dependence  of  FPM  on  the  selection  of  the  cut-off 
frequency. 

'Motion  blur,  coding  artifacts,  quantization  errors,  electronic  noise, 
etc.,  are  considered  to  be  video  noises. 


(a)  (b)  (c) 


Figure  8:  In  search  of  the  lip  region  type  with  80x48  pixel  size 
to  extract  visual  speech  features:  a)  exact  Hp  region,  b)  exact 
rectangular  lip  region,  c)  extended  rectangular  lip  region. 


(a)  (b) 

Figure  9:  Effect  of  interpolating  on  pixel  based  visual  feature 
extraction:  a)  re-interpolated  from  96x64  pixels  to  60x60  pixels, 
b)  re-interpolated  from  80x48  pixels  to  60x60  pixels. 


Table  1:  Visual-only  recognition  accuracy  for  connected  digit 
task  using  the  subset  of  the  normalized  2D  DCT  features,  FPM 
features,  and  concatenated  Al-FDs  and  FPM  features.  (LR: 
lip  region,  R-LR:  rectangular  LR,  ER-LR:  extended  R-LR,  bl.: 
blocks). 


Sub.  of  norm.  2D  DCT  using 

TR  V% 

TSr% 

exact  LR  with  ini.  80x48  pixels 

22.40 

21.60 

exact  LR  with  ini.  96x64  pixels 

23.00 

20.80 

R-LR  with  ini  80x48  pixels 

24.60 

17.20 

R-LR  with  ini.  96x64  pixels 

24.00 

19.60 

ER-LR  with  ini.  80x48  pixels 

22.80 

24,40 

ER-LR  with  ini.  96x64  pixels 

21.60 

21.60 

FPMs  using 

exact  LR  with  overlapping  bl. 

41.80 

19.60 

exact  LR  with  non-overlapping  bl. 

35.00 

24.00 

R-LR  with  overlapping  bl. 

38.80 

23.60 

R-LR  with  non-overlapping  bl. 

34.60 

22.00 

ER-LR  with  overlapping  bl. 

39.00 

22.00 

ER-LR  with  non-overlapping  bl. 

34.20 

19.60 

Concat.  AI-FDs  and  FPMs  using 

only  AI-FDs 

18.55 

21.33 

exact  LR  with  overlapping  bl. 

19.20 

18.40 

exact  LR  with  non-overlapping  bl. 

17.60 

18.40 

R-LR  with  overlapping  bl. 

18.40 

20.40 

R-LR  with  non-overlapping  bl. 

17.40 

18.40 

ER-LR  with  overlapping  bl. 

18.40 

17.60 

ER-LR  with  non-overlapping  bl. 

17.80 

18.80 

5.  Visual-Only  Experimental  Setup  and 
Results 

This  paper  discusses  visual  modality  speech  recognition 
(lipreading)  system  setup  and  results.  The  HMM  states  were 
modeled  with  continuous  density  Gaussians  with  single  mix- 
ture components.  The  aim  of  this  work  is  to  investigate  an 
affine  and  lighting  conditions  invariant  visual  feature  ex- 
traction method.  Therefore,  the  HMM  model  structure  was 
kept  basic.  The  HMM  implementation  was  word  level,  left- 
to-right  with  no  skip  transitions  with  ten  (eight  emitting  and 
two  non-emitting)  states,  and  diagonal  covariance  Gaussian 
mixture  components  since  we  assume  that  the  coefficients  in 
the  observation  vectors  are  naturally  independent.  All  the 
model  parameters  were  initialized  using  the  Viterbi  train- 
ing algorithm  and  re-estimated  using  the  Baum- Welch  re- 
estimation algorithm.  Viterbi  recognition  (dynamic  pro- 
gramming) algorithm  is  utilized  for  the  recognition. 

The  Clemson  University  Audio-visual  Experimental 
(CUAVE)  connected  and  continuous  audio-visual  digit 
database,  which  is  a thirty  six  subject  dataset,  was  utilized 
for  the  experiment.  The  visual-only  experimental  results 
are  presented  for  a connected  audio-visual  digit  recognition 
task.  The  following  visual  features  from  exact  lip  region, 
exact  rectangular  lip  region,  and  generous  rectangular  lip 
region  as  shown  in  Figures  8 and  9 are  utilized  in  the  visual- 
only  speech  recognition  system. 

1.  Subset  of  normalized  2D  DCT  features 

2.  FPM  features 

3.  AI-FD  features 

4.  Concatenated  AI-FDs  and  FPM  features 

The  subset  of  the  36  speaker  dataset,  containing  15 
speakers  each  is  uttering  five  times  0-9.  The  speakers  are 
split  into  training  (TR)  and  testing  (TS)  set  of  ten  and  five 
subjects,  respectively,  leading  to  speaker  independent  visual 
only  recognition  system.  The  results  are  shown  in  Table  1. 

6.  Concluding  Remarks  and  Future  Work 

Table  1 shows  the  visual-only  connected  digit  recognition 
results,  where  TR  corresponds  to  training  set  performance 
and  TS  corresponds  to  test  set  performance,  for  various  vi- 
sual features  discussed  in  this  paper.  The  subset  of  the  nor- 
malized 2D  DCT  features  based  on  the  training  set  results 
from  exact  rectangular  lip  region  gives  better  results  than 
the  exact  lip  region  and  extended  lip  region  (see  in  Figure 
9).  Another  observation  is  that  slight  change  in  lip  image 
content  due  to  the  linear  interpolation  has  effects  on  the  sys- 
tem’s performance. 

In  the  results  obtained  using  FPM  features,  the  train- 
ing set  performance  is  much  better  than  the  test  set  per- 
formance. Similar  performance  behavior  was  observed  for 
a speaker  dependent  recognition  task.  Therefore,  we  con- 
clude that  FPM  based  features  are  highly  video  noise  sen- 
sitive. The  overlapping  block  based  FPM  features  outper- 
formed the  non-overlapping  block  based  FPM  features  sig- 
nificantly in  the  training  set.  Among  the  three  different  lip 
regions  shown  in  Figure  9,  the  exact  lip  region  with  over- 
lapping blocks  method  outperforms  the  results  of  outer  two 
regions. 

In  the  results  obtained  using  concatenated  AI-FDs  and 
FPMs.  the  training  set  and  test  set  performances  are  close 
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to  each  other  and  worse  than  FPMs-only  results.  Therefore, 
we  conclude  that  each  feature  should  be  treated  as  a sepa- 
rate stream  and  weighted  properly  to  bring  the  additional 
information  from  one  another.  Similarly,  the  slight  perfor- 
mance increase  due  to  the  overlapping  block  of  FPM  fea- 
tures over  non-overlapping  block  based  FPM  features  can 
be  noticable. 

We  also  report  that  the  number  of  mixtures  in  the  Gaus- 
sian mixture  model  (GMM)  selection  and  teh  number  of 
states  in  the  silence  model  affects  the  performance  of  visual- 
only  system.  For  example,  setting  GMM  to  twelve  and  us- 
ing embedded  training  of  the  FPM  based  visual  only  system 
achieved  98%  recognition  accuracy  on  the  training  set,  but 
about  16%  on  the  speaker  independent  test  set  (which  is  less 
than  the  result  of  single  GMM  reported  in  Table  1.  The 
similar  behavior  is  observed  for  the  speaker  dependent  set. 
That  is,  the  system  is  being  well  trained  with  the  FPM  fea- 
tures, but  the  both  test  sets  are  behaving  like  an  unmatched 
system  due  to  the  resulting  noisy  observations. 

We  conclude  that  visual  noise  is  an  important  factor  in 
visual  speech  feature  extraction,  and  overlapping  local  im- 
age block  based  FPM  features  outperform  normalized  2D 
DCT  features,  AI-FD  features,  and  concatenated  AI-FDs 
and  FPM  features.  Future  work  will  include  initial  lip  seg- 
mentation for  the  Bayesian  framework  training  and  further 
study  on  the  noise  robust  FPM  feature  extraction. 
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